| Author | MG |
| Date | Nov 28, 2023 |
| Description | This notebook documents the process of data cleaning of |
| the dataset Building Permits offered by Seattle Open | |
| Data. |
This notebook shows the cleaning process and as such any results need to be read in a certain context. For instance, some figures are shown to understand the initial data which is later corrected. The actual analysis is described in a different document. At the end is the section that tests certain corrections on a final dataframe.
The source of data is here. The data is provided by Seattle Department of Construction and Inspections of the City of Seattle, Seattle Open Data – Building Permits.
The original data is contained in a single csv file.
The data is provided by the government entity and I assume they follow standards for collecting and publishing credible data.
It is reasonable to expect that the government agency that prepared the data, Seattle Department of Construction and Inspections, has taken measures to minimize any bias in the data.
The data is available as a Public Domain dataset.
# To render properly in HTML, I set
# exporting via VS Code "Jupyter:Export to HTML" results in showing the
# figures properly, but also the input.
#
# for exporting externally and hide the code-cells in the export file do:
# jupyter nbconvert --to html --no-input file.ipynb
import plotly.io as pio
pio.renderers.default='notebook'
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# for maps
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Instead of setting the cell to Markdown, create Markdown from withnin a code cell!
# We can just use python variable replacement syntax to make the text dynamic
from IPython.display import Markdown as md
# for file manipulation
import os.path
import myutils as ut
# set rows and columns to show all of it
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
# set the width of the column with no limits
pd.set_option('display.max_colwidth', None)
# setup for seaborn plots
sns.set_style('darkgrid')
sns.set_palette('Set2')
# the style of the map
MAP_STYLE = "open-street-map"
#MAP_STYLE = "stamen-terrain"
# the path to the original data
DATA_PATH_ORG='input/Building_Permits_20231024_saved.zip'
# the path to the processed data
DATA_PATH_PROCESSED='input/Building_Permits_20231024_clean.zip'
# the path to the description of the fields
DESC_PATH_ORG='input/csv_descr.csv'
# temporary file to work with
DATA_PATH_TEMP='input/Building_Permits_20231024_temp.csv'
The short description of the dataset is presented below. There are 25 columns and 168018 entries.
Pandas issues the warning that columns 6 and 10 have mixed values.
I added the parameter "thousands=','" to the function call pd.read_csv, otherwise it complains that column 6 have mixed values.
To accomodate the mixed types warning for column 10, I converted it explicitly to the date type. Since there are other dates, I also explicitly specify it when reading the file. The results are presented below.
df_org = ut.read_to_df(DATA_PATH_ORG, False)
df_org.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 168018 entries, 0 to 168017 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PermitNum 168018 non-null object 1 PermitClass 161578 non-null object 2 PermitClassMapped 161578 non-null object 3 PermitTypeMapped 168018 non-null object 4 PermitTypeDesc 157825 non-null object 5 Description 167731 non-null object 6 HousingUnits 26697 non-null float64 7 HousingUnitsRemoved 79331 non-null float64 8 HousingUnitsAdded 79331 non-null float64 9 EstProjectCost 133261 non-null float64 10 AppliedDate 129287 non-null datetime64[ns] 11 IssuedDate 122470 non-null datetime64[ns] 12 ExpiresDate 122626 non-null datetime64[ns] 13 CompletedDate 84417 non-null datetime64[ns] 14 StatusCurrent 168018 non-null object 15 RelatedMup 13689 non-null object 16 OriginalAddress1 167581 non-null object 17 OriginalCity 161380 non-null object 18 OriginalState 161470 non-null object 19 OriginalZip 157323 non-null object 20 ContractorCompanyName 28358 non-null object 21 Link 168018 non-null object 22 Latitude 165229 non-null float64 23 Longitude 165229 non-null float64 24 Location1 165229 non-null object dtypes: datetime64[ns](4), float64(6), object(15) memory usage: 32.0+ MB
The description of attributes is presented below:
df_desc_org = pd.read_csv(DESC_PATH_ORG, sep='|', header=None)
df_desc_org
| 0 | 1 | |
|---|---|---|
| 0 | PermitNum | The tracking number used to refer to this permit in SDCI's tracking system. |
| 1 | PermitClass | The permit class tells you the type of project.\t |
| 2 | PermitClassMapped | A description of whether the permit is for a residential or non-residential project. |
| 3 | PermitTypeMapped | The permit type by category, such as building, demolition, roofing, grading, and environmentally critical areas. |
| 4 | PermitTypeDesc | Additional information about the type of permit. For example, whether it is an addition/alternation or a new project. |
| 5 | Description | A brief description of the work that will be done under this permit. This description is subject to change before SDCI issues the permit. The description is generally more stable if we have issued the permit. Very long descriptions have been truncated. |
| 6 | HousingUnits | The number of housing units included at the beginning of the project. |
| 7 | HousingUnitsRemoved | The number of housing units removed during the project. |
| 8 | HousingUnitsAdded | The number of housing units added during the project. |
| 9 | EstProjectCost | The estimated project cost of the work being proposed is based on fair market value (parts plus labor). The estimated cost (if any) represents the best available information to date, and is subject to change if the project is modified. We do not collect the estimated project cost for all permit types. |
| 10 | AppliedDate | The date SDCI accepted the application as a complete submittal. |
| 11 | IssuedDate | The date SDCI issued the permit. If there is an Application Date but no Issue Date, this generally means the application is still under review. |
| 12 | ExpiresDate | The date the application is due to expire. Generally, this is the date by which work is supposed to be completed (barring renewals or further extensions). If there is not an Expiration Date, this generally means the permit has not been issued. |
| 13 | CompletedDate | The date the permit had all its inspections completed. If there is an Issue Date but not a Completed Date, this generally means the permit is still under inspection. |
| 14 | StatusCurrent | The current status in the application/review/inspection life cycle. This status shows the most recent process step that was fully completed. |
| 15 | RelatedMup | The land use permit that is related to this building permit, if there is one. |
| 16 | OriginalAddress1 | The street name and number of the project. |
| 17 | OriginalCity | The city for the project's address. |
| 18 | OriginalState | The state for the project's address. |
| 19 | OriginalZip | The Zip code for the project's address. |
| 20 | ContractorCompanyName | The contractor(s) associated with this permit. |
| 21 | Link | A link to view full details and current status information about this permit at SDCI's website. |
| 22 | Latitude | Latitude of the worksite where permit activity occurs. May be missing for a small number of permits considered "unaddressable." |
| 23 | Longitude | Longitude of the worksite where permit activity occurs. May be missing for a small number of permits considered "unaddressable." |
| 24 | Location1 | The latitude and longitude location for mapping purposes. |
Let's look at the data types. I also use the description of data types
showed on the data source website. The main
discrepancy is by automatically classifying attributes as object whereas
on the website they are defined as Text. There are also attributes
classified as float64 whereas they might be just integer values.
Specifically,
object; the
website classifies them as Text,float64; the website classifies it
as Number.object; the website classifies them as Floating Timestamp,
object; on the website as Text,float64; thewebsite defines it as Text so I read it as object to behave the same
as other object columns that should be string (see later).
NumberLocation on the website.show that these columns have missing values and only integer values, so
they will be read as integers. There has been an issue with reading a
column with numbers and NaN values and is automatically classified as float. I cannot specify explicitly Int64 as there is a ValueError reported: ValueError: Unable to parse string "1,014" so I need to
convert those columns in a separate step.
object thatalso supports strings so I leave them as they are.
I decided to leave Location1 as an object.
Let's find out what data is missing. In terms of the missing values, RelateMup, HousingUnits, and ContractorCompanyName have >80% values missing. RelateMup is a related land use permit and might be absent.
The missing data may be a problem for a particular business question that should be addressed if needed.
ut.count_missing(df_org).sort_values(['%Missing'], ascending=False)
| Attribute | #NotNull | #Missing | Total | %Missing | |
|---|---|---|---|---|---|
| 15 | RelatedMup | 13689 | 154329 | 168018 | 91.85 |
| 6 | HousingUnits | 26697 | 141321 | 168018 | 84.11 |
| 20 | ContractorCompanyName | 28358 | 139660 | 168018 | 83.12 |
| 7 | HousingUnitsRemoved | 79331 | 88687 | 168018 | 52.78 |
| 8 | HousingUnitsAdded | 79331 | 88687 | 168018 | 52.78 |
| 13 | CompletedDate | 84417 | 83601 | 168018 | 49.76 |
| 11 | IssuedDate | 122470 | 45548 | 168018 | 27.11 |
| 12 | ExpiresDate | 122626 | 45392 | 168018 | 27.02 |
| 10 | AppliedDate | 129287 | 38731 | 168018 | 23.05 |
| 9 | EstProjectCost | 133261 | 34757 | 168018 | 20.69 |
| 19 | OriginalZip | 157323 | 10695 | 168018 | 6.37 |
| 4 | PermitTypeDesc | 157825 | 10193 | 168018 | 6.07 |
| 17 | OriginalCity | 161380 | 6638 | 168018 | 3.95 |
| 18 | OriginalState | 161470 | 6548 | 168018 | 3.90 |
| 1 | PermitClass | 161578 | 6440 | 168018 | 3.83 |
| 2 | PermitClassMapped | 161578 | 6440 | 168018 | 3.83 |
| 22 | Latitude | 165229 | 2789 | 168018 | 1.66 |
| 23 | Longitude | 165229 | 2789 | 168018 | 1.66 |
| 24 | Location1 | 165229 | 2789 | 168018 | 1.66 |
| 16 | OriginalAddress1 | 167581 | 437 | 168018 | 0.26 |
| 5 | Description | 167731 | 287 | 168018 | 0.17 |
| 14 | StatusCurrent | 168018 | 0 | 168018 | 0.00 |
| 21 | Link | 168018 | 0 | 168018 | 0.00 |
| 3 | PermitTypeMapped | 168018 | 0 | 168018 | 0.00 |
| 0 | PermitNum | 168018 | 0 | 168018 | 0.00 |
The descriptive stats show some interesting facts that I will take a closer look at.
df_org.describe(include='all')
| PermitNum | PermitClass | PermitClassMapped | PermitTypeMapped | PermitTypeDesc | Description | HousingUnits | HousingUnitsRemoved | HousingUnitsAdded | EstProjectCost | AppliedDate | IssuedDate | ExpiresDate | CompletedDate | StatusCurrent | RelatedMup | OriginalAddress1 | OriginalCity | OriginalState | OriginalZip | ContractorCompanyName | Link | Latitude | Longitude | Location1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 168018 | 161578 | 161578 | 168018 | 157825 | 167731 | 26697.000000 | 79331.000000 | 79331.000000 | 1.332610e+05 | 129287 | 122470 | 122626 | 84417 | 168018 | 13689 | 167581 | 161380 | 161470 | 157323 | 28358 | 168018 | 165229.000000 | 165229.000000 | 165229 |
| unique | 168017 | 6 | 2 | 5 | 18 | 141033 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 24 | 7283 | 78687 | 6 | 2 | 46 | 5992 | 168017 | NaN | NaN | 72949 |
| top | 6870287-CN | Single Family/Duplex | Residential | Building | Addition/Alteration | Construct addition and alterations to existing single family residence, per plan. | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Completed | 3012843-LU | 701 5TH AVE | SEATTLE | WA | 98103 | U OF W BUILDING PERMIT | https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=6870287-CN | NaN | NaN | (47.60454088, -122.33073506) |
| freq | 2 | 95813 | 115278 | 135499 | 98111 | 390 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 116817 | 32 | 429 | 161333 | 161469 | 11078 | 595 | 2 | NaN | NaN | 428 |
| mean | NaN | NaN | NaN | NaN | NaN | NaN | 38.934000 | 0.209792 | 3.057103 | 1.275588e+06 | 2014-09-24 15:00:37.841391872 | 2014-10-29 00:25:35.122070784 | 2016-07-22 17:48:24.930438912 | 2014-07-29 06:30:20.533778944 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 47.625378 | -122.334198 | NaN |
| min | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 1986-04-28 00:00:00 | 1986-07-03 00:00:00 | 2001-10-24 00:00:00 | 2005-02-22 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 47.495809 | -122.430368 | NaN |
| 25% | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | 0.000000 | 0.000000 | 1.506400e+04 | 2010-04-02 00:00:00 | 2010-05-06 00:00:00 | 2012-01-19 00:00:00 | 2010-06-17 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 47.582701 | -122.363300 | NaN |
| 50% | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | 0.000000 | 0.000000 | 6.500000e+04 | 2015-03-20 00:00:00 | 2015-04-02 00:00:00 | 2016-12-08 00:00:00 | 2014-10-21 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 47.627543 | -122.333558 | NaN |
| 75% | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 | 0.000000 | 1.000000 | 2.543950e+05 | 2019-01-16 00:00:00 | 2019-03-04 00:00:00 | 2020-12-04 00:00:00 | 2018-02-05 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 47.670168 | -122.304099 | NaN |
| max | NaN | NaN | NaN | NaN | NaN | NaN | 1000000.000000 | 272.000000 | 891.000000 | 2.500000e+10 | 2023-10-20 00:00:00 | 2023-10-20 00:00:00 | 2025-05-18 00:00:00 | 2023-10-20 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 47.733972 | -122.221100 | NaN |
| std | NaN | NaN | NaN | NaN | NaN | NaN | 6120.246962 | 2.129944 | 23.233911 | 9.736551e+07 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.055536 | 0.038203 | NaN |
Below the values for categorical attributes are presented.
NONE and <NA> there are 388 NONE
values. I assume that they are equivalent to the <NA> value semantically.
Therefore, I decided to change the NONE values to <NA>.Seattle spelling (and an incorrect one).
I decided for SEATTLE spelling.WA spelling. I decided for WA
spelling.'''
Prints out unique values of specified columns
@param df_ A dataframe
@param cols_ The list of columns the unique values need to be printed out
@return None
'''
def print_uniq_vals(df_, cols_):
for c in cols_:
print(f"\n---------- {c}: {df_[c].unique()}")
print("**** Unique values for specific observations:")
print_uniq_vals(df_org, ['PermitClass', 'PermitClassMapped', 'PermitTypeMapped',
'PermitTypeDesc', 'StatusCurrent', 'OriginalCity',
'OriginalState'])
'''
It prints additional information to examine
@param
@return
'''
def print_extra_info(df_):
print( f"The number of 'NONE' values in PermitTypeDesc: {len(df_org[df_org['PermitTypeDesc'] == 'NONE'])}")
print("**** Extra information about data:")
print_extra_info(df_org)
**** Unique values for specific observations: ---------- PermitClass: ['Industrial' 'Vacant Land' 'Single Family/Duplex' 'Multifamily' 'Commercial' 'Institutional' nan] ---------- PermitClassMapped: ['Non-Residential' 'Residential' nan] ---------- PermitTypeMapped: ['Building' 'Demolition' 'Grading' 'Roof' 'ECA and Shoreline Exemption/Street Improvement Exception Request'] ---------- PermitTypeDesc: ['New' 'Demolition' 'Addition/Alteration' 'Tenant Improvment' nan 'Change of Use Only - No Construction' 'Deconstruction' 'Temporary' 'Relocation' 'Curb Cut' 'NONE' 'Environmentally Critical Area Exemption' 'Shoreline Exemption' 'Tenant Improvement Pre-Approval' 'Street Improvement Exception' 'Relief from Prohibition on Steep Slope' 'Modification to Submittal Requirements' 'Shoreline Permit Exemption' 'ECA Small Project Waiver'] ---------- StatusCurrent: ['Application Completed' 'Awaiting Information' 'Reviews In Process' 'Ready for Issuance' 'Issued' 'Reviews Completed' 'Corrections Required' 'Completed' 'Corrections Submitted' 'Withdrawn' 'Inspections Completed' 'Denied' 'Additional Info Requested' 'Approved to Occupy' 'Expired' 'Phase Issued' 'Closed' 'Scheduled' 'Canceled' 'Active' 'Initiated' 'Ready for Intake' 'Scheduled and Submitted' 'Pending'] ---------- OriginalCity: ['SEATTLE' nan 'Seattle' 'TUKWILA' 'KING COUNTY' 'seattle' 'SEATLLE'] ---------- OriginalState: ['WA' nan 'wa'] **** Extra information about data: The number of 'NONE' values in PermitTypeDesc: 388
The columns' unique values after corrections are presented below:
# replace 'NONE' with '<NA>' in PermitTypeDesc
df_org.loc[ df_org['PermitTypeDesc'] == 'NONE', ['PermitTypeDesc']] = np.nan
#len(df_org[ df_org['PermitTypeDesc'] == 'NONE' ])
# make uppercase letters for the city
df_org['OriginalCity'] = df_org['OriginalCity'].str.upper()
# fix the spelling of Seattle
df_org.loc[ df_org['OriginalCity'] == 'SEATLLE', ['OriginalCity']] = 'SEATTLE'
# make uppercase letters for the state
df_org['OriginalState'] = df_org['OriginalState'].str.upper()
#print_uniq_vals(df_org, ['PermitTypeDesc', 'StatusCurrent', 'OriginalCity', 'OriginalState'])
print_uniq_vals(df_org, ['PermitClass', 'PermitClassMapped', 'PermitTypeMapped',
'PermitTypeDesc', 'StatusCurrent', 'OriginalCity',
'OriginalState'])
---------- PermitClass: ['Industrial' 'Vacant Land' 'Single Family/Duplex' 'Multifamily' 'Commercial' 'Institutional' nan] ---------- PermitClassMapped: ['Non-Residential' 'Residential' nan] ---------- PermitTypeMapped: ['Building' 'Demolition' 'Grading' 'Roof' 'ECA and Shoreline Exemption/Street Improvement Exception Request'] ---------- PermitTypeDesc: ['New' 'Demolition' 'Addition/Alteration' 'Tenant Improvment' nan 'Change of Use Only - No Construction' 'Deconstruction' 'Temporary' 'Relocation' 'Curb Cut' 'Environmentally Critical Area Exemption' 'Shoreline Exemption' 'Tenant Improvement Pre-Approval' 'Street Improvement Exception' 'Relief from Prohibition on Steep Slope' 'Modification to Submittal Requirements' 'Shoreline Permit Exemption' 'ECA Small Project Waiver'] ---------- StatusCurrent: ['Application Completed' 'Awaiting Information' 'Reviews In Process' 'Ready for Issuance' 'Issued' 'Reviews Completed' 'Corrections Required' 'Completed' 'Corrections Submitted' 'Withdrawn' 'Inspections Completed' 'Denied' 'Additional Info Requested' 'Approved to Occupy' 'Expired' 'Phase Issued' 'Closed' 'Scheduled' 'Canceled' 'Active' 'Initiated' 'Ready for Intake' 'Scheduled and Submitted' 'Pending'] ---------- OriginalCity: ['SEATTLE' nan 'TUKWILA' 'KING COUNTY'] ---------- OriginalState: ['WA' nan]
I changed OriginalAddress1 to uppercase letters:
df_org['OriginalAddress1'] = df_org['OriginalAddress1'].str.upper()
df_org['OriginalAddress1'].describe()
count 167581 unique 76199 top 701 5TH AVE freq 436 Name: OriginalAddress1, dtype: object
This attribute shows that there is one record that is not unique; it is a duplicate PermitNum value. I examined the rows for that index: 11830 and 11831, and other attributes have the same values as well. So those two rows are identical and can be safely removed from the dataset.
It looks this is the only identifier in this dataset and it should be unique per dataset. There are no duplicate rows found either.
'''
print duplicates in permit
@param df_ data frame
@param col_ the name of the col you want to analyze
@return Nothing
'''
def print_dup(df_, col_):
#d_series = ut.find_duplicates(df_, [col_])
d_series = df_.duplicated(subset=[col_])
if d_series.any():
# print duplicates for unique do df_.loc[~d_series]
print(df_.loc[d_series])
print_dup(df_org, 'PermitNum')
PermitNum PermitClass PermitClassMapped PermitTypeMapped \
11831 6870287-CN Multifamily Residential Building
PermitTypeDesc \
11831 New
Description \
11831 Establish use as single family residence with 2 attached accessory dwelling units and construct 3-unit townhouse, per plans.
HousingUnits HousingUnitsRemoved HousingUnitsAdded EstProjectCost \
11831 NaN NaN NaN 555755.0
AppliedDate IssuedDate ExpiresDate CompletedDate StatusCurrent \
11831 2021-12-08 NaT NaT NaT Corrections Required
RelatedMup OriginalAddress1 OriginalCity OriginalState OriginalZip \
11831 3038111-LU 8661 BEACON AVE S NaN WA NaN
ContractorCompanyName \
11831 NaN
Link \
11831 https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=6870287-CN
Latitude Longitude Location1
11831 47.524625 -122.288077 (47.52462452, -122.28807667)
# rows with the duplicate PermitNum are identical
df_org.drop_duplicates(subset='PermitNum', keep='last', inplace=True)
df_org.describe(include='all')
| PermitNum | PermitClass | PermitClassMapped | PermitTypeMapped | PermitTypeDesc | Description | HousingUnits | HousingUnitsRemoved | HousingUnitsAdded | EstProjectCost | AppliedDate | IssuedDate | ExpiresDate | CompletedDate | StatusCurrent | RelatedMup | OriginalAddress1 | OriginalCity | OriginalState | OriginalZip | ContractorCompanyName | Link | Latitude | Longitude | Location1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 168017 | 161577 | 161577 | 168017 | 157436 | 167730 | 26697.000000 | 79331.000000 | 79331.000000 | 1.332600e+05 | 129286 | 122470 | 122626 | 84417 | 168017 | 13688 | 167580 | 161380 | 161469 | 157323 | 28358 | 168017 | 165228.000000 | 165228.000000 | 165228 |
| unique | 168017 | 6 | 2 | 5 | 17 | 141033 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 24 | 7283 | 76199 | 3 | 1 | 46 | 5992 | 168017 | NaN | NaN | 72949 |
| top | 6818110-CN | Single Family/Duplex | Residential | Building | Addition/Alteration | Construct addition and alterations to existing single family residence, per plan. | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Completed | 3012843-LU | 701 5TH AVE | SEATTLE | WA | 98103 | U OF W BUILDING PERMIT | https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=6818110-CN | NaN | NaN | (47.60454088, -122.33073506) |
| freq | 1 | 95813 | 115277 | 135498 | 98111 | 390 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 116817 | 32 | 436 | 161378 | 161469 | 11078 | 595 | 1 | NaN | NaN | 428 |
| mean | NaN | NaN | NaN | NaN | NaN | NaN | 38.934000 | 0.209792 | 3.057103 | 1.275593e+06 | 2014-09-24 14:31:19.331095040 | 2014-10-29 00:25:35.122071040 | 2016-07-22 17:48:24.930439168 | 2014-07-29 06:30:20.533778688 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 47.625379 | -122.334198 | NaN |
| min | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 1986-04-28 00:00:00 | 1986-07-03 00:00:00 | 2001-10-24 00:00:00 | 2005-02-22 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 47.495809 | -122.430368 | NaN |
| 25% | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | 0.000000 | 0.000000 | 1.506050e+04 | 2010-04-02 00:00:00 | 2010-05-06 00:00:00 | 2012-01-19 00:00:00 | 2010-06-17 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 47.582701 | -122.363300 | NaN |
| 50% | NaN | NaN | NaN | NaN | NaN | NaN | 0.000000 | 0.000000 | 0.000000 | 6.500000e+04 | 2015-03-20 00:00:00 | 2015-04-02 00:00:00 | 2016-12-08 00:00:00 | 2014-10-21 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 47.627546 | -122.333558 | NaN |
| 75% | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 | 0.000000 | 1.000000 | 2.543732e+05 | 2019-01-16 00:00:00 | 2019-03-04 00:00:00 | 2020-12-04 00:00:00 | 2018-02-05 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 47.670168 | -122.304104 | NaN |
| max | NaN | NaN | NaN | NaN | NaN | NaN | 1000000.000000 | 272.000000 | 891.000000 | 2.500000e+10 | 2023-10-20 00:00:00 | 2023-10-20 00:00:00 | 2025-05-18 00:00:00 | 2023-10-20 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 47.733972 | -122.221100 | NaN |
| std | NaN | NaN | NaN | NaN | NaN | NaN | 6120.246962 | 2.129944 | 23.233911 | 9.736588e+07 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.055535 | 0.038203 | NaN |
# check if there are any duplicates
print(f"Are there any duplicates in dataframe? Duplicates found:\n{ut.are_there_duplicates(df_org)}")
Are there any duplicates in dataframe? Duplicates found: Series([], dtype: bool)
md(f"The attribute *PermitNum* is a string {df_org.PermitNum.str.len().min()}"
f"-{df_org.PermitNum.str.len().max()} characters long.<BR>"
f"The attribute *Description* is {df_org.Description.str.len().min()}"
f"-{df_org.Description.str.len().max()} characters long.")
The attribute PermitNum is a string 10-14 characters long.
The attribute Description is 1.0-499.0 characters long.
# a few examples of *Description*
print(df_org[df_org.Description.str.len() < 5]['Description'].head(10))
30830 Demo 58837 Demo 108327 T.I. 130093 test 130370 stfi 130439 ULS 132367 TEST 132865 TI. 132892 TI 133381 VOID Name: Description, dtype: object
def correct_housing_units(df_):
idx = df_[df_['PermitNum'] == '6409645-CN'].index
df_.loc[idx, 'HousingUnits'] = df_.loc[idx, 'HousingUnitsAdded']
correct_housing_units(df_org)
print(df_org[df_org['HousingUnits'] > 10000])
#print(df_org[df_org['HousingUnits'] > 500])
#print(df_org[df_org['HousingUnitsAdded'] > 500])
#print(df_org[df_org['EstProjectCost'] > 10000000].loc[:, ['PermitNum', 'PermitClass', 'HousingUnits', 'EstProjectCost']])
df_org.describe(include=[float])
# 'HousingUnitsRemoved','HousingUnitsAdded', 'EstProjectCost'].describe()
Empty DataFrame Columns: [PermitNum, PermitClass, PermitClassMapped, PermitTypeMapped, PermitTypeDesc, Description, HousingUnits, HousingUnitsRemoved, HousingUnitsAdded, EstProjectCost, AppliedDate, IssuedDate, ExpiresDate, CompletedDate, StatusCurrent, RelatedMup, OriginalAddress1, OriginalCity, OriginalState, OriginalZip, ContractorCompanyName, Link, Latitude, Longitude, Location1] Index: []
| HousingUnits | HousingUnitsRemoved | HousingUnitsAdded | EstProjectCost | Latitude | Longitude | |
|---|---|---|---|---|---|---|
| count | 26697.000000 | 79331.000000 | 79331.000000 | 1.332600e+05 | 165228.000000 | 165228.000000 |
| mean | 1.477432 | 0.209792 | 3.057103 | 1.275593e+06 | 47.625379 | -122.334198 |
| std | 11.886870 | 2.129944 | 23.233911 | 9.736588e+07 | 0.055535 | 0.038203 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 47.495809 | -122.430368 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 1.506050e+04 | 47.582701 | -122.363300 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 6.500000e+04 | 47.627546 | -122.333558 |
| 75% | 1.000000 | 0.000000 | 1.000000 | 2.543732e+05 | 47.670168 | -122.304104 |
| max | 1014.000000 | 272.000000 | 891.000000 | 2.500000e+10 | 47.733972 | -122.221100 |
'''
Get some info about the project cost
@param df_ the dataframe
@param min_cost_ the minimum cost the project is considered expensive
@return None
'''
def print_info_EstProjectCost(df_, min_cost_):
labels = [ '<=100K', '(100K, 1M]', '(1M, 3M]', '(3M, 10M]', ('>10M') ]
print(labels.index)
df = df_['CostBinned'] = pd.cut(
df_['EstProjectCost'],
[0, 1.0e+05, 1.0e+06, 3.0e+06, 10.0e+06, df_org['EstProjectCost'].max()],
labels = labels,
include_lowest=True, ordered=True)
# df_['CostBinned'] = pd.qcut(df_org['EstProjectCost'], q= 5)
# how many projects in each project cost category
s = df_['CostBinned'].value_counts()
cost_binned_df = pd.DataFrame({'CostBinned' : s.index, 'Count' : s.values})
print("The table that shows projects grouped by cost and the number of projects in each group:")
print(cost_binned_df)
# create a color palette; colors on two pie charts are going
# in reverse order, so make sure it is less confusing if we
# have the same color for the same category
color_map = {}
for i in range(0, len(labels)):
color_map[labels[i]] = px.colors.qualitative.Plotly[i]
# without specifying 'color' the color_discrete_map is ignored
px.pie(cost_binned_df, values='Count', names='CostBinned',
title='Project Counts per Estimated Project Cost',
color='CostBinned', # color is taken by taking a unique list of this column
color_discrete_map=color_map,
hole=0.3).update_traces(
textposition='inside', textinfo='percent+label',
).show()
#np.digitize(df_['EstProjectCost'], bins=[100000, 1000000, 3000000,10000000])
# specify 'color' property if you want 'color_discrete_map'
px.pie(df_, values='EstProjectCost', names='CostBinned',
hole=0.3, color='CostBinned', color_discrete_map=color_map,
title='Estimated Project Cost per Total Value in a Group in US dollars.').show()
# show the most expensive projects
df_most_expensive = df_.loc[df_.EstProjectCost >= min_cost_,
[ 'PermitNum', 'PermitClass', #'PermitClassMapped', #'HousingUnits',
'HousingUnitsAdded', 'EstProjectCost'
#, 'StatusCurrent'
]]
print(f"EstProjectCost >= {min_cost_}: {len(df_most_expensive)}.")
print("Most expensive projects sorted in descending order: ")
print(df_most_expensive.sort_values(by = 'EstProjectCost', ascending=False).head(15))
'''
boxplot EstProjectCost according to permit class
@param df_ dataframe
@param max_cost_ set up the max cost that should be considered
@return Plotly figure
'''
def boxplot_EstProjectCost(df_, max_cost_):
fig = px.box(df_[df_.EstProjectCost <= max_cost_],
x='PermitClass', y = 'EstProjectCost', points='all',
title=f"Estimated Project Cost with max cost <= {max_cost_} US dollars.")
return fig
print_info_EstProjectCost(df_org, 0.5e+08)
<built-in method index of list object at 0x7f0625906b00> The table that shows projects grouped by cost and the number of projects in each group: CostBinned Count 0 <=100K 79021 1 (100K, 1M] 45935 2 (1M, 3M] 4358 3 (3M, 10M] 2007 4 >10M 1939
EstProjectCost >= 50000000.0: 475.
Most expensive projects sorted in descending order:
PermitNum PermitClass HousingUnitsAdded EstProjectCost
167552 6973064-DM Multifamily 52.0 2.500000e+10
167551 6973063-CN Multifamily 52.0 2.500000e+10
166997 6961341-CN Multifamily NaN 1.560000e+09
155573 6830099-CN Institutional NaN 1.443000e+09
152116 6781193-CN Multifamily NaN 1.000000e+09
152117 6781194-DM Multifamily NaN 1.000000e+09
164007 6879594-EX Institutional NaN 5.000000e+08
33821 6694765-PH Commercial 0.0 4.587680e+08
22117 6706025-PH Commercial 891.0 3.706000e+08
49 6973996-PH Multifamily NaN 3.210309e+08
56609 6438159-PH Commercial 0.0 2.922400e+08
167795 6977429-CN Institutional NaN 2.750000e+08
157280 6744090-CN Commercial 400.0 2.750000e+08
157281 6744091-DM Commercial 400.0 2.750000e+08
49910 6436184-PH Commercial 208.0 2.709659e+08
boxplot_EstProjectCost(df_org, 0.1e+08).show()
There are 23-27% missing values for all dates but CompletedDate which lacks almost 50% of values.
'''
It prints info about the dates in the dataframe
@param df_ The data frame
@return None
'''
def print_dates(df_):
dates = df_[['AppliedDate', 'ExpiresDate', 'IssuedDate', 'CompletedDate']]
print(ut.count_missing(dates))
print(f"len(df_) = {len(df_)}, len(dates)={len(dates)}")
print("Descriptive stats regarding the dates:")
print(dates.describe())
print(dates.head())
# show some info about the dates
print_dates(df_org)
Attribute #NotNull #Missing Total %Missing
0 AppliedDate 129286 38731 168017 23.05
1 ExpiresDate 122626 45391 168017 27.02
2 IssuedDate 122470 45547 168017 27.11
3 CompletedDate 84417 83600 168017 49.76
len(df_) = 168017, len(dates)=168017
Descriptive stats regarding the dates:
AppliedDate ExpiresDate \
count 129286 122626
mean 2014-09-24 14:31:19.331095040 2016-07-22 17:48:24.930439168
min 1986-04-28 00:00:00 2001-10-24 00:00:00
25% 2010-04-02 00:00:00 2012-01-19 00:00:00
50% 2015-03-20 00:00:00 2016-12-08 00:00:00
75% 2019-01-16 00:00:00 2020-12-04 00:00:00
max 2023-10-20 00:00:00 2025-05-18 00:00:00
IssuedDate CompletedDate
count 122470 84417
mean 2014-10-29 00:25:35.122071040 2014-07-29 06:30:20.533778688
min 1986-07-03 00:00:00 2005-02-22 00:00:00
25% 2010-05-06 00:00:00 2010-06-17 00:00:00
50% 2015-04-02 00:00:00 2014-10-21 00:00:00
75% 2019-03-04 00:00:00 2018-02-05 00:00:00
max 2023-10-20 00:00:00 2023-10-20 00:00:00
AppliedDate ExpiresDate IssuedDate CompletedDate
0 2023-10-20 NaT NaT NaT
1 2023-10-20 NaT NaT NaT
2 2023-10-20 NaT NaT NaT
3 2023-10-20 NaT NaT NaT
4 2023-10-20 NaT NaT NaT
md(f"<ul><li>The attribute <i>OriginalAddress1</i> is a string {df_org.OriginalAddress1.str.len().min()}"
f"-{df_org.OriginalAddress1.str.len().max()} characters long.</li>"
f"<li>The attribute <i>RelatedMup</i> is {df_org.RelatedMup.str.len().min()}"
f"-{df_org.RelatedMup.str.len().max()} characters long.</li>"
f"<li>The attribute <i>Link</i> is {df_org.Link.str.len().min()}"
f"-{df_org.Link.str.len().max()} characters long"
f"<li>The attribute <i>ContractorCompanyName</i> is {df_org.ContractorCompanyName.str.len().min()}"
f"-{df_org.ContractorCompanyName.str.len().max()} characters long"
f"</ul>"
)
Below are presented histograms to get a better insight into data. We can see that OriginalCity and OriginalZip might be worth to take a closer look at.
'''
Show histogram
@param df_ dataframe
@param x_ what you want to see a list of columns
@return nothing
'''
def show_hist(df_, x_):
for x in x_:
fig = px.histogram(df_, x = x, text_auto=True)
fig.show()
'''
show box plots for a given columns in a dataframe
@param df_ a dataframe
@param x_ a list of columns names to plot a boxplot
@return Nothing
'''
def show_box_plots(df_, x_):
for x in x_:
fig = px.box(df_, x = x)
fig.show()
show_hist(df_org,
['PermitClass', 'PermitClassMapped', 'PermitTypeMapped',
'PermitTypeDesc', 'StatusCurrent', 'RelatedMup','OriginalCity',
'OriginalZip'])
px.box(df_org, y = ['HousingUnitsRemoved', 'HousingUnitsAdded'], points='all')
#show_box_plots(df_org,
# ['HousingUnits', 'HousingUnitsRemoved',
# 'HousingUnitsAdded', 'EstProjectCost'])
There is only one permit for TUKWILA which CurrentStatus is Awaiting Information. If you search for the permit number 6915033-CN at Permit and Property Records and look at the documents Construction Permit in Record Snapshot.pdf and Geotechnical Report.pdf the location address is in Seattle and not in Tukwila, 3201 S NORFOLK ST SEATTLE, WA 98118, and 3301 South Norfolk Street in Seattle, respectively. Since there is a discrepancy regarding the exact project location, I use the address from the Construction Permit Record.
And one permit for KING COUNTY that was Withdrawn. If you search Permit and Property Records and search for the address 8819 S 116TH PL it will show SEATTLE as a city of the project and there are paperwork for the bathroom. So I changed it from KING COUNTY to SEATTLE.
ut.count_missing(df_org[['OriginalCity', 'OriginalZip']])
| Attribute | #NotNull | #Missing | Total | %Missing | |
|---|---|---|---|---|---|
| 0 | OriginalCity | 161380 | 6637 | 168017 | 3.95 |
| 1 | OriginalZip | 157323 | 10694 | 168017 | 6.36 |
df_org[df_org['OriginalCity'] == 'TUKWILA']
| PermitNum | PermitClass | PermitClassMapped | PermitTypeMapped | PermitTypeDesc | Description | HousingUnits | HousingUnitsRemoved | HousingUnitsAdded | EstProjectCost | AppliedDate | IssuedDate | ExpiresDate | CompletedDate | StatusCurrent | RelatedMup | OriginalAddress1 | OriginalCity | OriginalState | OriginalZip | ContractorCompanyName | Link | Latitude | Longitude | Location1 | CostBinned | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5708 | 6915033-CN | Commercial | Non-Residential | Building | New | Construct new commercial building (Prologis building 4), portion of building in City of Tukwila, occupy per plan. Mechanical is included. | 0.0 | 0.0 | 0.0 | 1732632.0 | 2022-10-26 | NaT | NaT | NaT | Awaiting Information | NaN | 10300 EAST MARGINAL WAY S | TUKWILA | WA | 98168 | NaN | https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=6915033-CN | 47.510472 | -122.291367 | (47.51047209, -122.29136723) | (1M, 3M] |
df_org[df_org['OriginalCity'] == 'KING COUNTY']
| PermitNum | PermitClass | PermitClassMapped | PermitTypeMapped | PermitTypeDesc | Description | HousingUnits | HousingUnitsRemoved | HousingUnitsAdded | EstProjectCost | AppliedDate | IssuedDate | ExpiresDate | CompletedDate | StatusCurrent | RelatedMup | OriginalAddress1 | OriginalCity | OriginalState | OriginalZip | ContractorCompanyName | Link | Latitude | Longitude | Location1 | CostBinned | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6258 | 6924436-CN | Single Family/Duplex | Residential | Building | Addition/Alteration | Interior alterations to bathroom in SFR, subject to field inspection, STFI. | NaN | NaN | NaN | 10500.0 | 2022-09-26 | 2022-09-26 | 2024-03-26 | NaT | Withdrawn | NaN | 8819 S 116TH PL | KING COUNTY | WA | 98178 | NaN | https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=6924436-CN | 47.498304 | -122.2211 | (47.49830434, -122.22109991) | <=100K |
'''
Get the index of the record
@param df_ dataframe
@param permitnum_ a string representing a permit number
@return An index representing the row location of a permit number in a dataframe
'''
def get_idx(df_, permitnum_):
return df_[df_['PermitNum'] == permitnum_].index
# correct the Tukwila address to the Seattle address:
permit = '6915033-CN'
idx = get_idx(df_org, permit)
df_org.loc[idx, ['OriginalAddress1', 'OriginalCity', 'OriginalZip']] = ['3201 S NORFOLK ST', 'SEATTLE', '98118']
print('After the correction of the TUKWILA record:')
df_org.loc[get_idx(df_org, permit), ['PermitNum', 'OriginalAddress1', 'OriginalCity', 'OriginalZip']]
After the correction of the TUKWILA record:
| PermitNum | OriginalAddress1 | OriginalCity | OriginalZip | |
|---|---|---|---|---|
| 5708 | 6915033-CN | 3201 S NORFOLK ST | SEATTLE | 98118 |
permit = '6924436-CN'
df_org.loc[get_idx(df_org, permit), 'OriginalCity'] = 'SEATTLE'
print("After correction of the OriginalCity of the KING COUNTY to SEATTLE:")
df_org.loc[get_idx(df_org, permit), ['PermitNum', 'OriginalAddress1', 'OriginalCity', 'OriginalZip']]
After correction of the OriginalCity of the KING COUNTY to SEATTLE:
| PermitNum | OriginalAddress1 | OriginalCity | OriginalZip | |
|---|---|---|---|---|
| 6258 | 6924436-CN | 8819 S 116TH PL | SEATTLE | 98178 |
md(f"<ul><li>The attribute <i>OriginalZip</i> is a string {df_org.OriginalZip.str.len().min()}"
f"-{df_org.OriginalZip.str.len().max()} characters long.</li>"
f"</ul>")
OriginalZip had found 24 non-5 characters long zip codes, and I manually searched them, and then updated them. The address of the record:
was not found in Seattle but in Longview. But the address 1516 23rd Ave, Seattle, WA 98122 was found in Seattle so I changed it to this address assuming this is the typo.
'''
Print information about the zip codes
@param df_ dataframe
@return df_masked.index the index of the masked dataframe (zip codes < 5 chars)
'''
def print_zip_info(df_):
print(f"Print records with OriginalZip that has less than 5 characters:")
mask = df_['OriginalZip'].str.len() < 5
df_masked = df_[mask]
print(df_masked[['PermitNum','OriginalAddress1', 'OriginalZip']])
return df_[mask].index
'''
corrects the zip codes
@param df_ dataframe
@param idx_ index of records that need to have their zip code updated
@return df_ some of the fields changed
'''
def correct_zips(df_, idx_):
zips = ['98107','98109','98199','98122','98101','98117','98117',
'98119','98178','98144','98115','98122','98117','98117',
'98105','98112','98112','98122','98117','98107','98107',
'98134','98134','98118']
assert len(idx_) == len(zips), 'The zip codes do not match the number of records to be corrected'
# correct the address
df_.loc[11718,'OriginalAddress1'] = '1516 23RD AVE'
mask = df_['OriginalZip'].str.len() < 5
for i in range(len(zips)):
df_.loc[idx_[i], 'OriginalZip'] = zips[i]
print("After correction the zip codes:")
print(df_[mask][['PermitNum', 'OriginalAddress1', 'OriginalZip']])
# find the zip codes and correct them
zip_idx = print_zip_info(df_org)
#print(zip_idx)
correct_zips(df_org, zip_idx)
Print records with OriginalZip that has less than 5 characters:
PermitNum OriginalAddress1 OriginalZip
440 6968761-CN 6049 SEAVIEW AVE NW 0
6210 6880555-DM 120 NEWTON ST 74
7480 6873706-CN 3217 W RUFFNER ST 100
7868 6883278-DM 324 LAKE WASHINGTON BLVD 0
7999 6882054-CN 1942 WESTLAKE AVE 35
9032 6880515-DM 8338 27TH AVE NW 90
10047 6880514-CN 8338 27TH AVE NW 90
11012 6879171-DM 1617 3RD AVE W 76
11273 6875557-CN 10440 67TH AVE S 92
11640 6873831-CN 711 M L KING JR WAY S 83
11709 6863817-CN 3222 NE 96TH ST 46
11718 6873274-CN 1615 23RD AVE 78
11895 6869481-CN 7753 26TH AVE NW 90
11898 6871773-CN 7755 26TH AVE NW 90
12488 6790584-DM 3815 4TH AVE NE 70
20519 6783372-CN 2346 43RD AVE E 0
151790 6783022-EX 2346 43RD AVE E 0
163426 6884721-CN 603 35TH AVE 78
163979 6878586-CN 7324 19TH AVE NW 90
165251 6944323-CN 5619 SEAVIEW AVE NW 0
165806 6930076-EX 3459 NW 54TH ST 0
165873 6932038-CN 1140 SW MASSACHUSETTS ST 0
165954 6933589-EX 1140 SW MASSACHUSETTS ST 0
167654 6974838-CN 9700 RAINIER AVE S 0
After correction the zip codes:
PermitNum OriginalAddress1 OriginalZip
440 6968761-CN 6049 SEAVIEW AVE NW 98107
6210 6880555-DM 120 NEWTON ST 98109
7480 6873706-CN 3217 W RUFFNER ST 98199
7868 6883278-DM 324 LAKE WASHINGTON BLVD 98122
7999 6882054-CN 1942 WESTLAKE AVE 98101
9032 6880515-DM 8338 27TH AVE NW 98117
10047 6880514-CN 8338 27TH AVE NW 98117
11012 6879171-DM 1617 3RD AVE W 98119
11273 6875557-CN 10440 67TH AVE S 98178
11640 6873831-CN 711 M L KING JR WAY S 98144
11709 6863817-CN 3222 NE 96TH ST 98115
11718 6873274-CN 1516 23RD AVE 98122
11895 6869481-CN 7753 26TH AVE NW 98117
11898 6871773-CN 7755 26TH AVE NW 98117
12488 6790584-DM 3815 4TH AVE NE 98105
20519 6783372-CN 2346 43RD AVE E 98112
151790 6783022-EX 2346 43RD AVE E 98112
163426 6884721-CN 603 35TH AVE 98122
163979 6878586-CN 7324 19TH AVE NW 98117
165251 6944323-CN 5619 SEAVIEW AVE NW 98107
165806 6930076-EX 3459 NW 54TH ST 98107
165873 6932038-CN 1140 SW MASSACHUSETTS ST 98134
165954 6933589-EX 1140 SW MASSACHUSETTS ST 98134
167654 6974838-CN 9700 RAINIER AVE S 98118
After correction the zip codes:
PermitNum OriginalAddress1 OriginalZip
440 6968761-CN 6049 SEAVIEW AVE NW 98107
6210 6880555-DM 120 NEWTON ST 98109
7480 6873706-CN 3217 W RUFFNER ST 98199
7868 6883278-DM 324 LAKE WASHINGTON BLVD 98122
7999 6882054-CN 1942 WESTLAKE AVE 98101
9032 6880515-DM 8338 27TH AVE NW 98117
10047 6880514-CN 8338 27TH AVE NW 98117
11012 6879171-DM 1617 3RD AVE W 98119
11273 6875557-CN 10440 67TH AVE S 98178
11640 6873831-CN 711 M L KING JR WAY S 98144
11709 6863817-CN 3222 NE 96TH ST 98115
11718 6873274-CN 1516 23RD AVE 98122
11895 6869481-CN 7753 26TH AVE NW 98117
11898 6871773-CN 7755 26TH AVE NW 98117
12488 6790584-DM 3815 4TH AVE NE 98105
20519 6783372-CN 2346 43RD AVE E 98112
151790 6783022-EX 2346 43RD AVE E 98112
163426 6884721-CN 603 35TH AVE 98122
163979 6878586-CN 7324 19TH AVE NW 98117
165251 6944323-CN 5619 SEAVIEW AVE NW 98107
165806 6930076-EX 3459 NW 54TH ST 98107
165873 6932038-CN 1140 SW MASSACHUSETTS ST 98134
165954 6933589-EX 1140 SW MASSACHUSETTS ST 98134
167654 6974838-CN 9700 RAINIER AVE S 98118
Let's check whether the zip codes belong to the Seattle area. The zip codes for comparison were taken from Zipcodes-us.com. There are two zip codes that do not belong to Seattle: 98019 (Duvall, WA) and 98004 (Bellevue, WA). It appears that both addresses can be found in Seattle, and neither can be found in the zip code. I change the zip codes to the 98199, and 98115, respectively.
df_org['OriginalZip'].info()
<class 'pandas.core.series.Series'> Index: 168017 entries, 0 to 168017 Series name: OriginalZip Non-Null Count Dtype -------------- ----- 157323 non-null object dtypes: object(1) memory usage: 6.6+ MB
'''
Checks whether the zip codes belong to the Seattle area
@param df_ the data frame
@return True the dataframe has only seattle zipcodes
False the dataframe has some outsiders
'''
def check_zipcode_area(df_):
SEA_ZIPCODES = [98101, 98102, 98103, 98104, 98105, 98106, 98107, 98108, 98109, 98110, 98111, 98112, 98113, 98114, 98115, 98116, 98117, 98118, 98119, 98121, 98122, 98124, 98125, 98126, 98127, 98129, 98131, 98133, 98134, 98136, 98138, 98139, 98141, 98144, 98145, 98146, 98148, 98154, 98155, 98158, 98160, 98161, 98164, 98165, 98166, 98168, 98170, 98174, 98175, 98177, 98178, 98181, 98185, 98188, 98190, 98191, 98194, 98195, 98198, 98199]
zcodes = set(pd.to_numeric(df_['OriginalZip'].dropna().unique()))
if (zcodes.issubset(SEA_ZIPCODES)):
print(f"Column has only Seattle zip codes: {zcodes}")
return True
else:
diff = zcodes.difference(SEA_ZIPCODES)
print(f"There are zip codes that are not Seattle zipcodes: {zcodes.difference(SEA_ZIPCODES)}")
for c in diff:
outsiders = df_[ df_['OriginalZip'] == c.astype(str)]
print(f"\nZip code {c} count: {len(outsiders)}")
print(f"The records for {c}\n{outsiders.head()}")
return False
check_zipcode_area(df_org)
There are zip codes that are not Seattle zipcodes: {98019, 98004}
Zip code 98019 count: 1
The records for 98019
PermitNum PermitClass PermitClassMapped PermitTypeMapped \
38131 6666330-CN Single Family/Duplex Residential Building
PermitTypeDesc \
38131 Addition/Alteration
Description \
38131 Interior alteration (kitchen remodel) with non load bearing wall removal subject to field inspection( STFI)
HousingUnits HousingUnitsRemoved HousingUnitsAdded EstProjectCost \
38131 NaN NaN NaN 45000.0
AppliedDate IssuedDate ExpiresDate CompletedDate StatusCurrent \
38131 2018-05-02 2018-05-02 2019-11-02 NaT Completed
RelatedMup OriginalAddress1 OriginalCity OriginalState OriginalZip \
38131 NaN 2507 28TH AVE W SEATTLE WA 98019
ContractorCompanyName \
38131 NaN
Link \
38131 https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=6666330-CN
Latitude Longitude Location1 CostBinned
38131 47.64147 -122.393351 (47.64147029, -122.39335124) <=100K
Zip code 98004 count: 1
The records for 98004
PermitNum PermitClass PermitClassMapped PermitTypeMapped \
38038 6667095-CN Commercial Non-Residential Building
PermitTypeDesc Description \
38038 Addition/Alteration Re-Roof commercial building for Mass & Mass.
HousingUnits HousingUnitsRemoved HousingUnitsAdded EstProjectCost \
38038 NaN NaN NaN 0.0
AppliedDate IssuedDate ExpiresDate CompletedDate StatusCurrent \
38038 2018-05-08 2018-05-08 2019-11-08 NaT Expired
RelatedMup OriginalAddress1 OriginalCity OriginalState OriginalZip \
38038 NaN 5844 NE 75TH ST SEATTLE WA 98004
ContractorCompanyName \
38038 NaN
Link \
38038 https://cosaccela.seattle.gov/portal/customize/LinkToRecord.aspx?altId=6667095-CN
Latitude Longitude Location1 CostBinned
38038 47.68342 -122.26404 (47.68341985, -122.26404013) <=100K
False
# corrections of the zip codes to Seattle codes from Duvall, and Bellevue
df_org.at[38131, 'OriginalZip'] = '98199'
df_org.at[38038, 'OriginalZip'] = '98115'
# again check the zip code area
check_zipcode_area(df_org)
Column has only Seattle zip codes: {98177, 98178, 98195, 98199, 98101, 98102, 98103, 98104, 98105, 98106, 98107, 98108, 98109, 98112, 98115, 98116, 98117, 98118, 98119, 98121, 98122, 98124, 98125, 98126, 98133, 98134, 98136, 98144, 98146, 98154, 98155, 98164, 98168}
True
According to field descriptions Location1 is a field that combines Latitude and Longitude. Let's check if indeed it is a case. It seems that there is no discrepancy between Latitude and Longitude and corresponding Location1. I have found no discrepancies if the difference between corresponding values was set either to 0.0001 or 1e-09. The numbers are displayed in Latitude and Longitued are displayed not in the same precision as in Location1. After the closer examination it looks like the precision is the same.
'''
Checks the consistencey of the geolocation
@param df_ The dataframe
@return None
'''
def check_geo(df_):
df = df_[['Latitude', 'Longitude', 'Location1']].copy()
# for checking get rid of the parens left and right and split on comma
df[['Lat', 'Lon']] = df['Location1'].str.replace('(', '').str.replace(')','').str.split(',', n=1, expand=True)
df['Lat'] = pd.to_numeric(df['Lat'])
df['Lon'] = pd.to_numeric(df['Lon'])
df['absLat'] = df['Latitude'] - df['Lat']
df['absLon'] = df['Longitude'] - df['Lon']
df['absLat'].abs()
df['absLon'].abs()
print(df.head())
def error(df1_, error_):
x = df1_[ (df1_['absLat'] >= error_) | (df1_['absLon'] >= error_) ]
print(f"Length of diff for Latitude and Longitude and Location1 >= {error_} = {len(x)}")
error(df, 0.0001)
error(df, 0.000000001)
check_geo(df_org)
Latitude Longitude Location1 Lat Lon \ 0 47.663480 -122.371648 (47.66348025, -122.37164843) 47.663480 -122.371648 1 47.511472 -122.248589 (47.51147177, -122.24858896) 47.511472 -122.248589 2 47.684953 -122.362213 (47.68495344, -122.36221339) 47.684953 -122.362213 3 47.558581 -122.266645 (47.55858123, -122.26664479) 47.558581 -122.266645 4 47.610569 -122.342377 (47.61056886, -122.34237665) 47.610569 -122.342377 absLat absLon 0 0.0 0.0 1 0.0 0.0 2 0.0 0.0 3 0.0 0.0 4 0.0 0.0 Length of diff for Latitude and Longitude and Location1 >= 0.0001 = 0 Length of diff for Latitude and Longitude and Location1 >= 1e-09 = 0
Now, let's take a look whether the location values are within the Seattle area. The viz shows us that there might be some concern regarding geolocations as the coordinates span much more than the area of Seattle. It needs to be taken into account when working with the data.
# center of the map
MAP_CENTER_LAT = 47.5
MAP_CENTER_LNG = -122.249
def plot_scatter_geo(df_, title_=""):
maxLat = df_['Latitude'].max()
minLat = df_['Latitude'].min()
maxLon = df_['Longitude'].max()
minLon = df_['Longitude'].min()
bounds = pd.DataFrame({
'Lat' : [ maxLat, maxLat, minLat, minLat ],
'Lon' : [ minLon, maxLon, minLon, maxLon ]
})
print(f"The boundary of locations in a dataframe (max and min values).\n{bounds}")
#print(df_)
#f = px.scatter_geo(bounds, lat = 'Lat', lon = 'Lon',
# #center = dict(lat = MAP_CENTER_LAT, lon = MAP_CENTER_LNG),
# scope='usa',
# mapbox_style = MAP_STYLE,
# title = title_)
f = px.density_mapbox(bounds, lat = 'Lat', lon = 'Lon', zoom=9,
mapbox_style=MAP_STYLE,
title = title_)
f.update_geos(fitbounds="locations")
return f
plot_scatter_geo(df_org, "Boundaries of Latitude and Longitude in the dataset.").show()
The boundary of locations in a dataframe (max and min values).
Lat Lon
0 47.733972 -122.430368
1 47.733972 -122.221100
2 47.495809 -122.430368
3 47.495809 -122.221100
At this point, I consider the data is clean enough and can be used for further processing.
The data comes from the credible source and is a Public Domain dataset.
The data cleanup included:
# write to csv our clean data
result, mesg = ut.wrt_to_csv(df_org, DATA_PATH_PROCESSED)
print(f"Writing to file: {DATA_PATH_PROCESSED}.\nResult = {result}\nMessage={mesg}")
Writing to file: input/Building_Permits_20231024_clean.zip. Result = False Message=WARN: File exists in location: input/Building_Permits_20231024_clean.zip. No write ...
md(f"The dataset after processing can be found in {DATA_PATH_PROCESSED}.")
The dataset after processing can be found in input/Building_Permits_20231024_clean.zip.
The data should be read as follows:
df = pd.read_csv(file_location, thousands=',',
parse_dates= ['AppliedDate', 'IssuedDate', 'ExpiresDate', 'CompletedDate'],
dtype={'OriginalZip' : 'object'})
And the types of the current dataframe are as follows:
df_org.info()
<class 'pandas.core.frame.DataFrame'> Index: 168017 entries, 0 to 168017 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PermitNum 168017 non-null object 1 PermitClass 161577 non-null object 2 PermitClassMapped 161577 non-null object 3 PermitTypeMapped 168017 non-null object 4 PermitTypeDesc 157436 non-null object 5 Description 167730 non-null object 6 HousingUnits 26697 non-null float64 7 HousingUnitsRemoved 79331 non-null float64 8 HousingUnitsAdded 79331 non-null float64 9 EstProjectCost 133260 non-null float64 10 AppliedDate 129286 non-null datetime64[ns] 11 IssuedDate 122470 non-null datetime64[ns] 12 ExpiresDate 122626 non-null datetime64[ns] 13 CompletedDate 84417 non-null datetime64[ns] 14 StatusCurrent 168017 non-null object 15 RelatedMup 13688 non-null object 16 OriginalAddress1 167580 non-null object 17 OriginalCity 161380 non-null object 18 OriginalState 161469 non-null object 19 OriginalZip 157323 non-null object 20 ContractorCompanyName 28358 non-null object 21 Link 168017 non-null object 22 Latitude 165228 non-null float64 23 Longitude 165228 non-null float64 24 Location1 165228 non-null object 25 CostBinned 133260 non-null category dtypes: category(1), datetime64[ns](4), float64(6), object(15) memory usage: 37.5+ MB
And the types of the dataframe after opening the cleaned dataframe, saved to a file, and compressed file is as below:
df_clean = ut.read_to_df(DATA_PATH_PROCESSED, False)
df_clean.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 168017 entries, 0 to 168016 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PermitNum 168017 non-null object 1 PermitClass 161577 non-null object 2 PermitClassMapped 161577 non-null object 3 PermitTypeMapped 168017 non-null object 4 PermitTypeDesc 157436 non-null object 5 Description 167730 non-null object 6 HousingUnits 26697 non-null float64 7 HousingUnitsRemoved 79331 non-null float64 8 HousingUnitsAdded 79331 non-null float64 9 EstProjectCost 133260 non-null float64 10 AppliedDate 129286 non-null datetime64[ns] 11 IssuedDate 122470 non-null datetime64[ns] 12 ExpiresDate 122626 non-null datetime64[ns] 13 CompletedDate 84417 non-null datetime64[ns] 14 StatusCurrent 168017 non-null object 15 RelatedMup 13688 non-null object 16 OriginalAddress1 167580 non-null object 17 OriginalCity 161380 non-null object 18 OriginalState 161469 non-null object 19 OriginalZip 157323 non-null object 20 ContractorCompanyName 28358 non-null object 21 Link 168017 non-null object 22 Latitude 165228 non-null float64 23 Longitude 165228 non-null float64 24 Location1 165228 non-null object 25 CostBinned 133260 non-null object dtypes: datetime64[ns](4), float64(6), object(16) memory usage: 33.3+ MB
'''
Quick test for values in a dataframe
@param df_ The dataframe
@return None
'''
def test_values(df_):
test_str = '\n***** TEST *****\n'
def test_passed(msg_='PASSED'):
print(msg_)
print(f'{test_str}There should be only names or "nan": ')
print_uniq_vals(df_, ['PermitClass', 'PermitClassMapped', 'PermitTypeMapped',
'PermitTypeDesc', 'StatusCurrent', 'OriginalCity',
'OriginalState'])
test_passed("ATTENTION: TEST RESULTS (ABOVE) NEED MANUAL VERIFICATION")
# check uppercase letters in 'OriginalAddress1'
print(f'{test_str}Checking whether OriginalAddress1 is in uppercase: ')
assert (df_['OriginalAddress1'].str.upper().equals(df_['OriginalAddress1']))
test_passed()
# check if there are any duplicates
print(f'{test_str}Checking duplicates in PermitNum...')
assert(len(ut.are_there_duplicates(df_)) == 0)
test_passed()
# check correction of the number of housing units
print(f"{test_str}Checking if HousingUnits column was corrected ...")
assert (len(df_[df_['HousingUnits'] > 10000]) == 0)
test_passed()
# check if we have only SEATTLE locations
print(f"{test_str}Checking if OriginalCity was corrected ...")
assert (len(df_[df_['OriginalCity'].notnull() & (df_['OriginalCity'] != 'SEATTLE') ]) == 0)
test_passed()
# check the zip code area if they are only from Seattle
print(f"{test_str}Checking if zip codes are from Seattle ...")
assert (check_zipcode_area(df_) == True)
test_passed()
print(f"\n\n************ TESTS SUMMARY *****************")
print(f"Check manually results of the first test.")
print("All automatic tests: PASSED")
# performing final tests
print('Performing final tests on a cleaned dataframe....')
test_values(df_clean)
Performing final tests on a cleaned dataframe....
***** TEST *****
There should be only names or "nan":
---------- PermitClass: ['Industrial' 'Vacant Land' 'Single Family/Duplex' 'Multifamily'
'Commercial' 'Institutional' nan]
---------- PermitClassMapped: ['Non-Residential' 'Residential' nan]
---------- PermitTypeMapped: ['Building' 'Demolition' 'Grading' 'Roof'
'ECA and Shoreline Exemption/Street Improvement Exception Request']
---------- PermitTypeDesc: ['New' 'Demolition' 'Addition/Alteration' 'Tenant Improvment' nan
'Change of Use Only - No Construction' 'Deconstruction' 'Temporary'
'Relocation' 'Curb Cut' 'Environmentally Critical Area Exemption'
'Shoreline Exemption' 'Tenant Improvement Pre-Approval'
'Street Improvement Exception' 'Relief from Prohibition on Steep Slope'
'Modification to Submittal Requirements' 'Shoreline Permit Exemption'
'ECA Small Project Waiver']
---------- StatusCurrent: ['Application Completed' 'Awaiting Information' 'Reviews In Process'
'Ready for Issuance' 'Issued' 'Reviews Completed' 'Corrections Required'
'Completed' 'Corrections Submitted' 'Withdrawn' 'Inspections Completed'
'Denied' 'Additional Info Requested' 'Approved to Occupy' 'Expired'
'Phase Issued' 'Closed' 'Scheduled' 'Canceled' 'Active' 'Initiated'
'Ready for Intake' 'Scheduled and Submitted' 'Pending']
---------- OriginalCity: ['SEATTLE' nan]
---------- OriginalState: ['WA' nan]
ATTENTION: TEST RESULTS (ABOVE) NEED MANUAL VERIFICATION
***** TEST *****
Checking whether OriginalAddress1 is in uppercase:
PASSED
***** TEST *****
Checking duplicates in PermitNum...
PASSED
***** TEST *****
Checking if HousingUnits column was corrected ...
PASSED
***** TEST *****
Checking if OriginalCity was corrected ...
PASSED
***** TEST *****
Checking if zip codes are from Seattle ...
Column has only Seattle zip codes: {98177, 98178, 98195, 98199, 98101, 98102, 98103, 98104, 98105, 98106, 98107, 98108, 98109, 98112, 98115, 98116, 98117, 98118, 98119, 98121, 98122, 98124, 98125, 98126, 98133, 98134, 98136, 98144, 98146, 98154, 98155, 98164, 98168}
PASSED
************ TESTS SUMMARY *****************
Check manually results of the first test.
All automatic tests: PASSED
For exporting externally and hide the code-cells in the export file do:
$ jupyter nbconvert --to html --no-input file.ipynb
[1] Seattle Department of Construction and Inspections, Seattle Open Data – Building Permits. City of Seattle, 2023. [Online]. Available: https://data.seattle.gov/Permitting/Building-Permits/76t5-zqzr
[2] Permit and Property Records [Online]. Available https://web.seattle.gov/dpd/edms/